Availability: Adds logic to avoid bad replica during cache refresh #3127
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Description
Current design:
If the SDK get's a 410 or other failure that signals a replica has moved to a different machine an address cache refresh is triggered. The cache refresh returns the stale information until the new address list is returned, because the 3 other replicas should still be valid and can complete requests. This still gives a 25% chance of a new replica going to the bad replica which can possibly take multiple seconds for the connection to timeout.
The solution:
The GatewayAddressCache individual addresses have a unhealthy flag. When a cache refresh is requested the bad replica will be marked as unhealthy. When the SDK goes to pick a random replica it will always move the unhealthy replicas to the end of the list. When the results from the gateway return the health state is reset. It will only avoid the replica during call to get the new addresses from the gateway.
The "unhealthy" state would be reset when a Gateway refresh response comes back (whether addresses changed or not) or after 1 minute - whatever comes first. So the throughput SLA regression risk (temporarily only using 3 out of 4 replica) is only applicable for at most 1 minute.
// Cache refresh design
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber